Variable Selection in Data Mining: Building a Predictive Model for Bankruptcy

نویسندگان

  • Dean P. Foster
  • Robert A. Stine
چکیده

We predict the onset of personal bankruptcy using least squares regression. Although well publicized, only 2,244 bankruptcies occur in our data set of 2.9 million months of credit-card activity. We use stepwise selection to find predictors from a mix of payment history, debt load, demographics, and their interactions. This combination of rare responses and over 67,000 possible predictors leads to a challenging modeling question: How does one separate coincidental from useful predictors? We show that three modifications turn stepwise regression into an effective methodology for predicting bankruptcy. Our version of stepwise regression (1) organizes calculations to accommodate interactions, (2) exploits modern decision theoretic criteria to choose predictors, and (3) conservatively estimates p-values to handle sparse data and a binary response. Omitting any one of these leads to poor performance. A final step in our procedure calibrates regression predictions. With these modifications, stepwise regression predicts bankruptcy as well, if not better, than recently developed data-mining tools. When sorted, the largest 14,000 resulting predictions hold 1000 of the 1800 bankruptcies hidden in a validation sample of 2.3 million observations. If the cost of missing a bankruptcy is 200 times that of a false positive, our predictions incur less than 2/3 of the costs of classification errors produced by the tree-based classifier C4.5. Key Phrases: AIC, Cp, Bonferroni, calibration, hard thresholding, risk inflation criterion (RIC), stepwise regression, step-down testing.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Predicting Bankruptcy of Companies using Data Mining Models and Comparing the Results with Z Altman Model

One of the issues helping make investment decisions is appropriate tools and models to evaluate financial situation 0f the organization.  By means of these tools, investors can analyze financial situation of the organization and identify financial distress or an ideal condition, they become aware of making decisions to invest in appropriate conditions.  The main objective of this study is to ev...

متن کامل

Variable Selection in Data Mining: Building a Predictive Model for Bankruptcy A01-028-R2

We predict the onset of personal bankruptcy using least squares regression. Although well publicized, only 2,244 bankruptcies occur in our data set of 2.9 million months of credit-card activity. We use stepwise selection to find predictors of these from a mix of payment history, debt load, demographics, and their interactions. This combination of rare responses and over 67,000 possible predicto...

متن کامل

Applying Variable Deletion Strategies in Bankruptcy Studies to Capture Common Information and Increase Their Reality

In financial distress studies selection of variable is commonly basedon the success of variables in variable sets employed in earlierbankruptcy studies, suggestions in the literature or an accompanyingdata reduction in a large set of variables. If seemingly different variablesets exhibit a strong relationship then heterogeneous variable setscapture common information. Canonical correlation anal...

متن کامل

Variable selection and corporate bankruptcy forecasts

We investigate the relative importance of various bankruptcy predictors commonly used in the existing literature by applying a variable selection technique, the least absolute shrinkage and selection operator (LASSO), to a comprehensive bankruptcy database. Over the 1980 to 2009 period, LASSO admits the majority of Campbell, Hilscher, and Szilagyi’s (2008) predictive variables into the bankrupt...

متن کامل

Variable Selection Method Affects SVM Approach in Bankruptcy Prediction

This paper examined bankruptcy predictive accuracy of five statistics models-discriminant analysis logistic regression, probit regression, neural networks, support vector machine (SVM), and genetic-based SVM (GA-SVM) that influenced by variable selection. Empirical results indicate that the SVM-based models are very promising models for predicting financial failure, in terms of both best predic...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2001